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Abstract — Truncated Singular Value Decomposition (SVD) cal- 
culates the closest rank-/c approximation of a given input matrix. 
Selecting the appropriate rank k defines a critical model order 
choice in most applications of SVD. To obtain a principled 
cut-off criterion for the spectrum, we convert the underlying 
optimization problem into a noisy channel coding problem. 
The optimal approximation capacity of this channel controls 
the appropriate strength of regularization to suppress noise. 
In simulation experiments, this information theoretic method to 
determine the optimal rank competes with state-of-the art model 
selection techniques. 

I. Introduction 

Singular Value Decomposition (SVD) is a widely used 
technique for exploratory data analysis. It decomposes a 
given input matrix into a product of three matrices such 
that X = USV T . Thereby, U and V are unitary matrices 
and they essentially induce a rotation of the input data. S 
is a diagonal matrix (inducing a scaling) with the singular 
values as entries. Quite often, one is rather interested in an 
approximation of X instead of the exact decomposition such 
as, for instance, in Principal Components Analysis (PC A). 
For SVD, this approximation technically requires to set all 
but the first k diagonal entries in S to 0. The resulting 
reconstruction US/eV T has rank k. Neglecting all but the 
first k components is justified since the noise in the data 
perturbs the small eigenvalues, whereas the first k components 
supposedly capture the underlying structure or the signal of the 
data. Selecting the cutoff value k defines the central model- 
order selection problem of truncated SVD. 

In this paper we propose a novel method for selecting 
k which is based on the framework of approximation set 
coding (ASC) |T|. ASC defines the maximum approximation 
capacity (maxAC) principle for model-order selection. For a 
given model order and a given noisy dataset, ASC theory 
enables us to compute the capacity of a hypothetical channel. 
MaxAC selects the model that achieves highest capacity, i.e., 
the model of highest complexity that still can be robustly 
optimized in the presence of noise. Originally, max AC was 
derived for discrete optimization problems. So far, it was 
applied to decoding for the binary channel [ 1 ] and for selecting 
the number of clusters in clustering problems (2). In this 
contribution, we adopt it, for the first time, to a continuous 
optimization problem, namely for selecting the cut-off rank 
of truncated SVD. Thereby, we investigate the challenges 
of the continuous solution space. Moreover, we provide an 
experimental comparison with other model- selection methods 



indicating that our maxAC adaption to SVD can compete with 
state-of-the art methods. In the remainder of the paper we 
will first introduce the main concepts of ASC in Section [TT| 
In Section [Til] we then derive the approximation capacity for 
SVD, point to problems and solutions and report our numerical 
findings. 

II. Approximation Set Coding 

In this section we briefly recapitulate the principle of 
maximum approximation capacity (maxAC). Due to the page 
limit, we must refer to the original paper (H for a more 
detailed description. 

Consider a generic optimization problem with the input 
X = {xi, . . . ,xat} G X consisting out of TV measurements. 
The vector G M D identifies the i th measurement. Let 
the output of the optimization problem be the solution or 
hypothesis c G C where C is the set of all hypotheses 
satisfying the constraints of the problem, the hypothesis 
space. An optimization problem involves a cost function 
R : C x X — >■ R> that maps a hypothesis c to a real value 
R(c, X). Finally, let c x = arg min c R(c, X) be the hypothesis 
that minimizes costs. In order to rank all solutions of the 
optimizer, we introduce approximation weights 

w : C x X x R + [0, 1] , (1) 

(c,X,/3) i y w p (c,X) = exp(-PR(c,X)). 

These weights, parameterized by the inverse computational 
temperature /3, define the two weight sums Z\,Z<i, and the 
joint weight sum Z\i 



Z„ := 



^exp(-/3i?(c,xM)), i/ = 1,2 (2) 

cec 

Z 12 := ^exp(-/?(i?(c,XW) + i?(c,X( 2 )))), (3) 

cec 

where X^ 1 ^ and X^ 2 ^ are two random subsets of the input X 
and exp(-/3( J R(c,X^ 1 )) + R(c,X^))) measures how well 
a single solution c minimizes costs on both datasets. The 
sums p|3[ ) play a central role in ASC. If f3 = 0, all weights 
wp(c,X) = 1 are independent of the costs. In this case, 
Z v = \C\ indicates the size of the hypothesis space, and 
Z12 = Z\ = Z<i. For intermediate /3, Z(-) takes a value 
between and \C\, giving rise to the interpretation of Z(-) 
as the effective number of hypotheses that approximately fit 



the dataset whereas fj defines the precision of this 

approximation. 

ASC constructs a hypothetical coding scenario where sender 
and receiver have to communicate problem solutions via 
transformations r G T of noisy datasets X^ 1 ^ and X^ 2 ^. As 
derived in d, an asymptotically vanishing error rate for such 
a channel is achievable for rates 

'\{r s }\-Z 12 



N 



log 



(4) 



Eq.Q denotes the mutual information between the encoded 
transformation r s of the dataset and the decoded transforma- 
tion f . A higher approximation precision f3 has two effects. 
First, the hypothesis space is covered by more patches of ap- 
proximate solutions leading to more available codewords and 
a higher codeword entropy |{r s }|. Second, these patches have 
a higher overlap such that, in the decoding step, codewords are 
possibly confused. The maximum of Zp(r s ,T) with respect to 
f3 is called the approximation capacity. 

III. Approximation capacity of SVD 

To select the cut-off rank for SVD, we have to find the rank 
/c* with maximal approximation capacity. This search involves 
maximizing Eq.Q for all ranks that we investigate. By going 
through the list of all quantities introduced in the last section, 
we relate them to truncated SVD. The input of the problem 
is a matrix X G R 7Vxi:) . The cost function is the Frobenius 
norm: 



R(X, U, S, V) = ^2 I Xij - ^2 u it s ttVtj 



t=i 



An optimizer for this problem outputs a decomposition 
c-L(XM) = UMS^'VM, whereas all but the first k < 
mm(N,D) diagonal entries of are due to truncation 
(we will denote the third matrix by V instead of V T for 
convenience). The solution c x (X^^) gives the closest rank-/c 
approximation of X^) with respect to the Frobenius norm. In 
the case of SVD, a hypothesis c = (U, S,V) is a particular 
decomposition of the input matrix (in clustering, for instance, 
c is a relation that assigns objects to clusters Q). The k x D 
matrix with the entries w t j := s u vtj is a new basis and U 
provides the linear weights needed to represent the data X in 
this basis. When the empirical mean of X is the origin, this 
representation corresponds to PCA. 

We define the hypothesis space for truncated SVD with cut- 
off rank k as follows. For a fixed basis W, the hypothesis 
space is spanned by all TV x k matrices U. For a given dataset 
XM, U^) is the cost-minimizing solution. We parameterize 
the different transformations r G T for encoding messages 
in datasets by to = + U r W^. The identity 

transformation is Uid = 0. 

The mutual information Eq. ([4]) was originally derived 
for finite hypothesis spaces, such as for clustering solutions 
or binary message strings. The challenges of computing the 
weight sums ^ and ^ are twofold. First, in a small volume 
of a continuous hypothesis space, there are infinitely many 



transformations. Second, transformations can have infinite 
distance to each other such that the receiver can always 
distinguish an infinite subset of T even when the datasets are 
noisy. This makes the union bound in the derivation of the 
error-probability in [ 1 ], Eq. (7) inadequate. For these reasons, 
calculating the capacity under the assumption that U can 
be any real N x k matrix fails. In the following, we will 
first demonstrate the effect of this assumption by providing 
the naive analytical calculation of the mutual information in 
Eq. ([4]). Then we introduce constraints on the transformations 
T such that Eq. ^ can be computed. 

A. Unconstraint Hypothesis Space 

We abbreviate R v = i?(xW, U, S^, VM) and R A = 
1/2 ^(Xt^U^t 1 ), VW) + fl(X( 2 ),U,S( 1 ), V' 1 ')) and 
use the short-hand notation of mi as the i th row of a matrix 
M. By integrating over the space of all linear combinations 
U, we obtain the weight sums: 

/co 
exp [-f3R v ] dU , ve{l,2] (6) 
-CO 

= n ^ r e A& ™vy-*& dkui 

Z 12 = Texp [-PR*] dV = [Te-^B^r ) (7) 

J-OO ij 

s -/?/2(2(EN It ^0 2 - 2 K 1)+ ^ , )^". t ^ 1, ) dfeui 



The solution of Gaussian integrals yields 



/CO 
-co 



/2 Ef E?/ A M ,u t u t ,+£t fc b t u t / ( 27r ) l/2b T A- 



whereas in our case 



det(A) 



(8) 



A (A)_ A (i) 
The weight sums are 



(«)/ 7T 
%12 = 



\ 



(27r) fe 



dot ( 2/3 w^Vj 



M\VSi(^(i-w!? T ('!3Vf)-y?)) 



3 

NDk _ N 

n(»r) 1 

3 



(9) 
(10) 



We abbreviated := det(w l j M w 1 ^). In step (i), we 

used that for a n x n matrix M and a scalar p, it holds 
that det(pM) = p n det(M). In step (ii), we used that 



F W := w H( w H T w H)-i w H T = L substituting to 
Eq. l|4]> provides the mutual information: 



HP) = (log (Z 12 ) - log (Zi) - log (2a)) 



(11) 



Dk 



) _ T (2) 



The first order condition provides the optimal temperature 



(2) 



(12) 



Note that the temperature monotonically increases with the 
distance of the two datasets, as one would expect: with more 
noise, the precision for approximating the optimal solution 
must be lower. However, unexpectedly, the temperature de- 
creases with k suggesting that a higher rank stabilizes the 
solutions. This misconception is a consequence of the uncon- 
strained hypothesis space, as discussed earlier, and indicates 
that constraints for U are necessary. Also, we neglected the 
temperature-independent term \{r s }\ in Eq. ^ which would 
be infinitely high. 

B. Finite and Bounded Hypothesis space 

In the discussion above, we identified two problems: an 
infinitely large capacity due to i) an infinitely large transfor- 
mation space R 7Vxfe (or a negative one if we disregard the in- 
finitely many possible codewords) and ii) due to the existence 
of infinitely many transformations in an arbitrary small volume 
of this space. For a practical implementation of the max AC 
criterion for SVD we have to i) bound the hypothesis space 
and ii) constraint the density of transformations to a finite 
number. This renders the integrals for computing the weight 
sums to finite sums which must be explicitly computed. In our 
experiments, we use two ways of summing over the hypothesis 
space. First, the transformations populate the hypothesis space 
on an equispaced grid in a hypercube of finite size. Second, we 
randomly sample transformations from an isotropic Gaussian. 
In both cases the set of transformations is centered around 
the cost minimizing solution U^ 1 ^ (the identity transformation 
Uid). For both ways, one must choose the boundaries (the 
size of the grid or the variance of the Gaussian) as well as the 
number of transformations. 

We experimentally investigate the influence of these param- 
eters. First, we study the influence of the integration range 
on the capacity. We create data from a mixture of 4 

Gaussians with isotropic noise, leading to an optimal rank 4. 
We compute the approximation capacity by sampling transfor- 
mations from a Gaussian sphere around the cost-minimizing 
SVD solution with standard deviation a. Our experi- 

mental findings for various magnitudes of a are illustrated in 



(1,2) 



(1) 



Fig. j^J We write a in units of A = 1 /N Y2i=i 
whereas U (1 ' 2) is the matrix that satisfies X^ 2 ^ = U (1 ' 2) W w . 
In the regime where 1/N ||U r || « A, a transformed dataset 
Tj o X^ 2 ) could possibly be confused with X.^\ When the 



transformations are smaller than A, none of the transforma- 
tions could possibly be used in a codebook as they are all 
indistinguishable from the identity transformation. As a result, 
the obtained capacity is too low. On the contrary, for a too 
high integration range, the capacity converges to the naive 
analytical solution because infinitely many transformations 
could serve as distinguishable codewords. 
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Fig. 2. Numerically computed approximation capacity for various sizes of 
the subspace containing the transformations. 

The second experiment studies the influence of the number 
of transformations on the mutual information. This time, 
we use a grid of fixed size and vary the density of grid 
points. In order to sufficiently cover the hypothesis space, we 
increase the number of transformations by a factor of 2 when 
increasing k. While this increment is still too low to preserve 
the transformation density, it already imposes a computational 
challenge. In our experiments with larger datasets, we sample 
the hypothesis space more sparsely. The influence of the 
number of transformations is illustrated in Fig. [T] The results 
demonstrate that this number only affects the stability of the 
computation and not the maximum of the capacity. 

C. Continuous and bounded hypothesis space 

The numerical experiment on the influence of the number 
of transformations suggests that for a defined transformation 
density, the analytical solution should provide the desired 
result if only the integration range is properly defined. We 
calculate the mutual information as in Section \U1-A\ but, this 
time, we weight the integrand with an isotropic Gaussian 
around the identity u\l\v i,t to suppress the contribution of 



heavy transformations. 

■/ 

J — c 



-I3R U 



- dU, v e {1,2} (13) 



-12 : 



-PR* 



T,i,t( u it-u* it ) 2 dU 



(14) 



The derivation of the mutual information is provided in Ap- 
pendix B. The simulations depicted in Fig. [3] illustrate that for 
an analytically computed mutual information (see Eq. (|40|), 



the width a influences the capacity much more than in the 
numerical computation (compare with Fig. [3]). However, if a 
maximum exists (square markers in Fig. [3]), it is at the correct 
rank. 
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Fig. 1. Approximation capacity against rank for various numbers of sample points. The optimal rank is k = 4. Even though the individual trends still vary 
a lot for very few sample points, the optimal model-order is already found. With increasing number of transformations the calculations are stabilized. 
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Fig. 3. Analytically computed approximation capacity integrated over an 
isotropic Gaussian sphere with varying standard deviation. 



D. Comparison with other model- selection techniques 

We study how well maxAC and other model-order selec- 
tion methods select the appropriate rank for approximating a 
noisy dataset via rank- limited SVD. A comparison with the 
following methods is performed: 'Laplace' and 'BIC' approx- 
imate the marginal likelihood (the evidence) for probabilistic 
PCA [3]. The first method applies Laplace approximation to 
the involved integral. The well-known BIC score (4) further 
simplifies that the likelihood exhibits the same sensitivity to 
all model parameters. A third method, the minimum transfer 
cost principle ('MTC') |5], mimics cross-validation. It learns 
a rank-limited SVD on one random subset of the data and then 
computes the costs when applying it to another one. Thereby, 
the model parameters are transfered by a mapping function 
ijj which maps each object in the first dataset to its nearest 
neighbor in the second dataset. 

We create objects from a number of centroids with a defined 
separation from each other and add Gaussian noise. The 
difficulty of the problem is controlled by altering the variance 
relative to the separation of the centroids. When going to a 
higher number of centroids, the dimensionality must also be 
increased to preserve their separation. To enable a comparison 
with the PCA methods, the data mean is shifted to the origin. 

For a given true number of generating components and a 
given noise level relative to the centroid separation, there exists 
one SVD rank that yields the reconstruction with the minimal 
deviation from the noise-free matrix. For very noisy data or 
a high number of components and dimensions, this optimally 
denoising rank is smaller than the true number of generating 
components. Inspecting Fig. [4] one can see that all methods 
select a rank between the best denoising rank and the true 



rank. For low noise (Fig. 4(a)) learning the rank is easy. For 
a high noise level, the learning problem becomes hard when 
the number of generating components and the dimensionality 
increase. There exists a transitional regime where all methods 



start selecting a lower rank than the true one (Fig. 4(b)). For 
very high noise levels, all methods select k = 1 (Fig. 4(c)[). 



IV. Conclusion 

We proposed a novel technique for selecting the cut-off 
rank of truncated SVD. Our criterion selects the rank that 
maximizes the approximation capacity (maxAC) and, thereby, 
captures the maximum amount of information in the data that 
can by reliably inferred from random subsets of the input 
dataset. We demonstrated, for the first time, how to apply 
maxAC to an optimization problem with a continuous solution 
space like SVD. The Euclidean geometry of the parameter 
space renders the computation of the approximation capacity 
more difficult than in the clustering case with random permu- 
tations |2|. We discussed the challenges with such problem 
domains and proposed solutions. Finally, we demonstrated in 
comparative experiments that the model-order selection of our 
technique conforms with established methods. Future work 
will address the application of the max AC criterion to other 
continuous optimization problems such as, for instance, sparse 
linear regression. 
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Appendix A: attaining capacity 

There are several ways of numerically maximizing the 
mutual information in Eq. [4] with respect to the temperature. 
The simplest way is to compute Ip for several values of f3 
and pick the maximum. This creates a quantization error of 
the optimum that depends on the step-size of the temperature 
scale. Moderate step-sizes already lead to good results as Ip 
is usually flat around its maximum. 

Using a gradient-descent method like Newton iterations 
provide more precise results . In the following, we report the 
first and the second derivation of Ip which are needed for the 
Newton updates. The derivation of the mutual information Ip 
(Eq. [4]) with respect to the inverse temperature f) is 



am 

dp 

=1/N 



^l g(|AC 7 |)-^log(|cM|)j (15) 



1 d 1 1 1 d 



|AC 7 | dp 



7 



(16) 



=1 iJ ^-^ exp [ ~ pRv] dlJ \ - G^ exp [ ~ PRa] dlJ N 

=1/N (j^ E IR»} PG (R„) - E[R A } PG(R ^ J (17) 

Where pg(Ra) = Z~ x exp(— (3Ra) is the Gibbs distribution 
with normalization constant Z = J^exp [— /3Ra] dU. These 
expectations can easily be computed either for a finite set of 
transformations or with a continuous integral. 
Accordingly, the second derivative is: 

d 2 I{p) =UN (^ /9 j->exp[-R]rfU \ 
an#Aexp[-/3i? A ]dU\ 



3fi /^exp \-PR A ] dV 



(18) 



V/N (E[R v ] 2 pG(Rv) - E[R% G{R ^j 

- 1/N (E[i?|] PG(i?A) - E[R a ] 2 pg{r&) ) (19) 



Appendix B: Analytical Calculation with bounded integration range 

We derive the mutual information Eq. ( |40| > when the transformations are weighted with a Gaussian centered around the 
identity transformation u\l' . Except for this modification the derivation is analog to the derivation of the unconstraint mutual 
information in Eq. (fTT). The approximation sets and the joint approximation set are 



and 



I AC, 



<^7 



= J°° exp (-^(XM,U, SM.VM)) exp ^- A. - u^) 2 j <iU , v G {1, 2} 

= n-p(-^ ( ; )2 )ex P (-^E< 2 ) 



(20) 
(21) 
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■ exp [ -2/3 ^ Uit ( u? t ( J )2 «it + 2w t ( ^ ^ u iV w^] 
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Uit 



d k Ui. 



j°° exp (-0/2 (i?(xW,U,S( 1 ),V( 1 )) + i?(X( 2 ),U,S( 1 ),V( 1 ))))exp ^ £> it - v£>? j dU (22) 
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/oo I k I k 

exp -p ^2 u n w tj )2u it + 2^ 1} 

■exp (k? +»g ) H 1) - <*V • 



(23) 
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(a) Low relative noise 
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(b) High relative noise 
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Fig. 4. Rank-k approximation for a mixture of Gaussians. In (a) and (b) the number of generating components varies, in (c) the noise level varies for 3 
components. For readability, we omitted the variances in (c). They are comparable with (b). All methods select a rank between the true number of components 
and the rank that minimizes the distance to the noise-free matrix ('Best denoising'). 



The characteristic terms of the integrals are 



b (iJ) ~ ZP X iJ W J - Ui 
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These terms determine the cardinalities of the approximation sets: 
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The number of random transformations depends on the size of the Gaussian sphere b : 
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Define the auxiliary variables := w y w j . Then, the mutual information i 

m = l/iV^Kr.H+logdA^D-^logd^l)^ 
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